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Hate speech can be defined as the use of language to express hatred towards 
another party. Twitter is one of the most widely used social media platforms 
in the community. In addition to submitting user-generated content, other 
users can provide feedback through comments. There are several users who 
intentionally or unintentionally provide negative comments. Even though 
there are regulations regarding the prohibition of hate speech, there are still 
those who make negative comments. Using the deep learning method with 
the long short-term memory (LSTM) model, a classifier of possible hate 
speech from messages on Twitter is carried out. With the ensemble method, 
term frequency times inverse document frequency (TF-IDF) and global 
vector (GloVe) get 86% accuracy, better than the stand-alone word to vector 
(Word2Vec) method, which only gets 80%. From these results, it can be 
concluded that the ensemble method can improve accuracy compared to only 
using the stand-alone method. Ensemble methods can also improve the 
performance of deep learning systems and produce better results than using 
only one method. 
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1. INTRODUCTION 

Hate speech can be defined as the use of language to express hatred for a group or to humiliate 
person or members of that group [1]. Hate speech can also be defined as a deliberate attack directed at a 
particular group motivated by aspects of group identity [2]. In this technology era, internet and social media 
platforms have become an integral part of social life [3]. Twitter is one of the social media platforms that is 
widely used by the public. Twitter is a social media service that allows users to share content such as text, 
images, and videos, and users can also view content from other users. Apart from just viewing content, users 
can also provide feedback regarding the content in the form of comments. However, the comments given by 
the public can be in the form of positive or negative comments. Many users use social media to spread hate 
speech that is motivated by personal desires. But there are also some users who accidentally leave negative 
comments or don't know the meaning of the message. There is a possibility that the writer does not mean to 
write negatively, but the reader can interpret the writing negatively. 

There is debate about banning hate speech. According to a written journal article by Howard [4], this 
debate should be divided into several parts. The first concerns the scope of the moral right to freedom of 
expression. Second, there is the moral obligation to refrain from hate speech. The third relies on pragmatic 
concerns involving epistemic fallibility, abuse of state power, and the benefits of speech against coercion. 
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Now, there are regulations regarding hate speech on social media. However, it is not uncommon for users to 
still provide negative comments or hate speech. 

So, with this, research will be carried out to classify possible hate speech from text using deep 
learning with two layers, namely the embedding layer and the layer for long short-term memory (LSTM). 
The embedding layer will use word to vector (Word2Vec) and the ensemble methods of term frequency times 
inverse document frequency (TF-IDF) and global vector (GloVe). The LSTM was selected in this study 
because it performs effectively on a wide range of large issues and is now frequently used [5]. Previously, 
research on emotion detection using TF-IDF and the LSTM model by Haryadi and Kusuma [6] found an 
accuracy of 99.22% using LSTM and 99.18% using nested LSTM. Then, research conducted by Nurrohmat 
and Azhari [7] found that a sentiment analysis of the novel had an accuracy of 72.85% using Word2Vec with 
LSTM. As quoted from research on hate speech detection by Malik et al. [8], using Glo Ve with the Bi-LSTM 
model obtains 84% accuracy in the first dataset. 

The problem that will be worked on in this paper is to compare the level of accuracy of the results of 
classifying hate speech between Word2Vec and the ensemble method from TF-IDF and GloVe. This study is 
expected to be useful in identifying text messages that may contain hate speech. With this, it can also help 
readers decide the intent of text messages that are likely to contain hate speech. It is also hoped that the 
resulting data from this study can be used as a basis for future research and the development of applications 
that aim to detect hate speech. 


2. METHOD 

The following is a schematic of the stages in the hate speech classification method, as shown in 
Figure 1. These are the stages that used in this research. In the preprocessing phase, the data undergoes 
several steps such as tokenization, stop word removal, and lemmatization. Once the preprocessing is 
complete, the next step is embedding. The embedding methods employed will be Word2Vec, TF-IDF, and 
Glove. Word2Vec will be implemented as a standalone model, while TF-IDF and GloVe will be used in an 
ensemble approach. Subsequently, the data will be trained using an LSTM, and the results will be discussed. 


Embedding 
Layer 


LSTM Layer 


Figure 1. Hate speech classification flowchart 


Based on Figure 1, there are steps that are divided into: i) preprocessing, ii) embedding layers, iii) 
LSTM layer, and iv) results. These steps will be elaborated on in the following sections. The results obtained 
will include performance metrics such as precision, recall, Fl-score, and accuracy. The outcomes of this 
research will be compared to draw conclusions about the model that exhibits the highest efficiency and 
accuracy. 


2.1. Preprocessing 

In preprocessing, there are several steps, namely: i) tokenization, ii) stop word removal, and iii) 
lemmatization. Tokenization is the process of breaking down text into words, phrases, symbols, and other 
elements [9]. It aims to be able to explore the words in one sentence. The stop words referred to here are 
some words that are often found but are meaningless because they are only used to combine words in 
sentences [9]. For example, "it," "and," "the," and so on make a bad index on a document [10]. The next step 
is lemmatization. Lemmatization is similar to stemming, which breaks down words into root words. 
However, lemmatization parses words according to the context in which they are used because there are 
several words that have multiple meanings [11]. So, in this preprocessing step, we are sorting and filtering 
sentences so they can be read by computer algorithms. 

The dataset used was obtained from the Kaggle website, which contains tweets about the grand old 
party (GOP) Debate in Ohio in early August 2016 [12]. This dataset containe index, id, candidate, 
candidate_confidence, relevant_yn, relevant_yn_confidence, sentiment, sentiment_confidence, 
subject_matter, subject_matter_confidence, candidate_gold, name, relevant_yn_gold, retweet_count, 
sentiment_gold, subject_matter_gold, text, tweet_coord, tweet_created, tweet_id, tweet_location. This 
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dataset contains 13871 total indexes. The sentiment column contains 2236 positive, 8493 negative, and 
3142 neutral. Because the main objective of this study is to classify positive and negative data, neutral data 
will be deleted, so that the number of datasets that will be used is 10729. The diagram in Figure 2 shows the 
sum of both positive and negative data. 


Dataset labels distnbuition 


Positive Negative 


Figure 2. Positive and negative dataset diagram 


2.2. Embedding layer 

Word embedding is a vector representation of words obtained by inserting words with semantic and 
syntactic meanings obtained from a large corpus [13]. In this embedding layer, three techniques are used, 
namely the stand-alone Word2Vec and the ensemble method. The methods are going to be ensembled is the 
TF-IDF and GloVe models. 


2.2.1. Word to vector 

Word2Vec is a neural network that represents words in vector form [14]. Word2Vec has two 
models, namely the continuous bag-of-words (CBOW) model and the continuous skip-gram model. The 
CBOW model is used in this study. The CBOW model combines existing words to predict middle 
words [15]. Figure 3 shows the layers in the CBOW architecture. The CBOW layer projects all words to the 
same position, and the all-word vector maintains the mean and shares the positions of all words [16]. The 
CBOW architecture predicts current words based on sentence context [17]. 


Input Projection Ouput 


Wit-2) 
Wit-1) 
W(t) 


W(t+1) 


W(t+2) 


Figure 3. CBOW architecture [15] 


2.2.2. Ensemble method 

The ensemble method improves the prediction performance of a single model by training many 
models and combining predictions from these models [18]. In this ensemble method, a combination of 
TF-IDF and GloVe will be used. The reason for using this ensemble method is hoping that can give better 
results. 
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2.2.2.1. TF-IDF 

TF-IDF is a formal measure that is concentrated into relatively few documents regarding the 
occurrence and importance of certain words [19]. In other words, TF-IDF counts the number of times a word 
appears in a document. This is useful for information retrieval and text mining. 


TF.IDF =TF(t,d) x IDF(t) (1) 


TF-IDF is a merger between term frequency (TF) and inverse document frequency (IDF). TF is the 
number of times a particular word appears in a document [20]. IDF is the calculation of the number of 
documents in a collection that contain the term in question [21]. 


2.2.2.2. GloVe 

The last embedding technique that will be used together with TF-IDF in the ensemble method is 
GloVe. Based on literature review, the GloVe word embedding model has a high level of accuracy compared 
to other word embedding models such as FastText and others. GloVe model was first introduced in 2014 and 
is quite popular at this time. GloVe is a model that captures global corpus statistics, which are the main 
source of information for studying word representation [22]. GloVe, in other words, calculates the 
relationship between words in a text based on the frequency with which the word appears. With GloVe, 
machines can use large datasets with billions of words that may not be accessible to derive statistically strong 
word meanings [8]. In other words, GloVe is a word fusion method that has advantages in capturing global 
context, handling proportionality, good interpretability, and reliability in various natural language processing 
tasks. GloVe's word representation reflects semantic and syntactic relationships, making it a popular and 
effective choice in a variety of natural language processing applications. 


2.3. Long short-term memory layer 

For the classification of hate speech from text, the LSTM model was chosen over the transformer- 
based architecture. The capacity of the LSTM to capture long-term dependencies in sequential data, the 
restrictions of available computational resources, the incorporation of ensemble methods, and the necessity 
for interpretability all contributed to this conclusion. This decision was made after careful assessment of the 
task's nature, available resources, and empirical evaluation to arrive at the best solution in the specific 
context. When compared to transformer-based architectures, which have numerous models for their 
individual demands based on the language used in the text, LSTM can cover everything, and the LSTM 
model continues to be commonly used in several contemporary studies. LSTM is an recurrent neural network 
(RNN) architecture specifically designed to overcome the problem of missing gradients [23]. The main 
difference with RNNs is that they have problems with inputs that are too far in the past. Block memory is the 
hidden layer of the LSTM, which consists of subnets that are connected repeatedly [23]. The LSTM memory 
block has three gates: i) the input gate (7), ii) the forget gate (f), and iii) the output gate (f,) as shown in 
Figure 4. 


Figure 4. Block memory LSTM [24] 


The input gate controls the flow of input activity to memory cells. The output gate controls the flow 
of output activity to the next network. Before entering it as input via the cell connection, the forget gate 
measures the state of the internal cell [25]. With this, it can adaptively forget or reset memory cells. 


Bulletin of Electr Eng & Inf, Vol. 13, No. 3, June 2024: 1913-1919 


Bulletin of Electr Eng & Inf ISSN: 2302-9285 O 1917 


Figure 4 [24] is an example of a block memory. Information is carried through the cell state (C)), 
which is like the internal state in the LSTM. In the cell state, there is data that can be discarded or continued 
to the next memory block and processed in the hidden state. 


h= o(W, [he-1, Xt] + br) (2) 
Cp = fe x Ce-1 +i, x Ci (3) 


Forget gate (f) is calculated by combining the current input (x,) and the previous hidden state (h,-1), 
with weight (W,) and bias (bf), and entering the sigmoid function (øo). The sigmoid function has an output 
range of 0-1. Then, the forget gate (f) is multiplied by the previous cell state (C;.i). So, forget gate is a 
calculation of how much of the previous cell state is removed. If forget gate is 0, then the previous cell state 
is deleted vice versa [24]. 


I, = o(W; [hi-1, xt] + be) (4) 
Ci = tanh(W, [he-1, x] + be) (5) 


The input gate (Z) is similar to the forget gate, consisting of the current input (x;) and the previous 
hidden state (/;.;), with weight (W;) and bias (b,), and entering the sigmoid function (o). For input, use weight 
(W.) and bias (b-). Then the results of the input gate and input (Č;) are multiplied to enter the cell state (C,). 
So the input gate indicates how much input goes into the cell state [24]. 


fo = o(Wo[he-1, Xt] + bo) (6) 
hy = fo x tanh(C;) (7) 


The gate output (fə) consists of the current input (x,) and the previous hidden state (h,-1), with weight 
(Wo) and bias (bo), and enters the sigmoid function (a). Output (h;) is the multiplication of the output gate and 
the hyperbolic tangent (tanh) of the cell state that has been found. So how much output becomes the hidden 
state (h,) is regulated by the output gate [24]. 


3. RESULTS AND DISCUSSION 

In this study, this experiment implemented the model using Scikit-Learn, Keras, and NLTK. This 
research uses a dataset that contains messages sent by the public through the social media application 
Twitter, which contains tweets about the GOP Debate in Ohio in early August 2016 [12]. The stand-alone 
classifier using the Word2Vec embedding method uses the categorical_crossentropy for the loss function. 
The LSTM model uses | layer, 196 neurons and 1 dense. Total parameters that are used in this model are 
3,333,445. This model gets 40% precision, 50% recall, and a 44% Fl-score on the macro average. The 
weighted average gets 64% precision, 80% recall, and a 71% Fl-score. Then, in the TF-IDF and GloVe 
ensemble methods, we also use the LSTM model to get 81% precision, 70% recall, and a 73% Fl-score in the 
macro average. The weighted average gets 85% precision, 86% recall, and an Fl-score of 84%. 

All macro and weighted average results show that TF-IDF+GloVe outperforms Word2Vec, as do 
both accuracy results. This model uses binary_crossetropy for the loss function, 196 neurons on the LSTM 
model. There are 3 dense layers in this model, 21 for the first layer, 21 for the second layer, and 1 for the 
third layer. Total parameters that are used in this model are 1,063,951. This model has an accuracy of 86%, 
which is higher than Word2Vec's accuracy of 80%. The results of the two experiments show that the results 
of the TF-IDF and GloVe ensemble methods can improve accuracy compared to Word2Vec. The results with 
the ensemble method also show an increase in accuracy over the results obtained by Montalvo, with an 
accuracy of 85% [12]. The results are shown in Table 1. 


Table 1. Macro and weighted average F; score performance 


; Macro AVG Weighted AVG ; 
Embedding Model Precision Recall Fl-Score Precision Recall Fl-Score Accuracy 
Word2Vec LSTM 0.40 0.50 0.44 0.64 0.80 0.71 0.80 

TF-IDF+GloVe LSTM 0.81 0.70 0.73 0.85 0.86 0.84 0.86 
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4. CONCLUSION 

From this experiment, it can be concluded that the ensemble method can improve accuracy and be 
better than the stand-alone method. This is evident from the experimental results, which show that the 
accuracy of the ensemble method with TF-IDF and GloVe is better than just one method, namely Word2Vec. 
It can be seen that the ensemble method should be implemented because, with many methods, it can improve 
the deep learning system to be better than using only one method and provide more optimal results. 

In the future work, based on the experimental results that have been obtained, efforts will be made to 
optimize accuracy. Then, try to add some other deep learning models in future research. In the ensemble 
method, other algorithms will be added, such as random forest, logistic regression, and support vector 
classifier, to find other possible accuracy levels that are more optimal. Other features from Word2Vec, 
TF-IDF, and GloVe will also be used in future research. 
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